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Abstract 

The  problem  of  finding  all  occurrences  of  a  pattern  of  length  m  in  a  text  of  length  n  is 
considered.  It  is  shown  that  the  Boyer-Moore  pattern  matching  algorithm  performs  roughly 
3n  comparisons  and  that  this  bound  is  tight  up  to  0(n/Tn)\  more  precisely,  an  upper  bound 
of  3n— —  comparisons  is  shown,  as  is  a  lower  bound  of  3n(l— o(l))  comparisons,  BlS  -^  —  oo 
and  m  — »  oo.  While  the  upper  bound  is  somewhat  involved,  its  mam  elements  provide  a 
quite  simple  proof  of  a  An  upper  bound  for  the  same  algorithm. 

1      Introduction 

Pattern  matching  is  the  problem  of  finding  a  pattern  of  length  m  in  a  string  of  length  n; 
often,  all  occurences  of  the  pattern  are  sought.  This  problem  is  well  studied  and  is  a  staple  of 
text  books  on  algorithms  (for  instance,  [AHU73,BB88]).  It  is  an  important  subproblem  in  a 
number  of  domains  including  text  editing,  symbol  manipulation  and  data  retrieval. 

The  best  known  algorithms  for  this  problem  are  the  Knuth-Morris-Pratt  algorithm  [KMP77] 
and  the  Boyer-Moore  algorithm  [BM77]  (for  short,  we  refer  to  these  as  the  KMP  and  BM  algo- 
rithms, respectively).  Both  these  algorithms  are  linear  time;  the  bound  for  the  KMP  algorithm 
is  very  straightforward,  the  bound  for  the  BM  algorithm,  considerably  less  so.  An  interesting 
aspect  of  the  BM  algorithm  is  that  on  average  (in  probabilistic  settings)  it  takes  sublinear  time; 
this  effect  is  observed  in  practice  too.  A  recent  study  of  this  behavior  is  given  in  [BGR90]. 

Both  the  KMP  and  BM  algorithms  begin  by  computing  a  shift  function  (we  review  the 
shift  function  for  the  BM  algorithm  later).  Following  the  precomputation  of  the  shift  function, 
the  actuaJ  match  is  carried  out.  The  complexity  of  the  algorithm  is  usuaJly  stated  in  terms 
of  the  number  of  comparisons  required  for  the  matching  stage  (excluding  the  precomputation 
stage).  The  KMP  algorithm  requires  2n  -  m  comparisons  in  the  worst  case  and  this  is  a  tight 
bound  for  m  >  2.  For  the  Boyer-Moore  algorithm,  the  first  linear  bound  was  given  by  Knuth 
in  [KMP77],  a  bound  of  In  comparisons;  this  proof  is  difficult.  In  1980,  Guibas  and  Odlyzko 
gave  a  less  complex  proof  [GO80],  obtaining  a  bound  of  4n  comparisons;  still,  their  proof, 
some  eight  journal  pages,  is  not  easy  reading.  Guibas  and  Odlyzko  also  conjectured  that  2n 
comparisons  might  be  the  correct  bound.  Our  contribution  is  twofold. 

•  We  give  a  bound,  tight  up  to  lower  order  terms,  of  roughly  3n  comparisons,  thereby 
disproving  the  just  mentioned  conjecture. 


•The  work  was  supported  in  part  by  NSF  grants  CCR-8902221  and  CCR-8906949. 


•   In  addition,  the  basic  elements  of  this  proof  provide  a  direct  and  straightforward  demon- 
stration of  an  upper  bound  of  4n  comparisons.  This  proof  is  some  three  pages  long. 

The  above  bounds  for  the  BM  algorithm  assume  that  the  pattern  is  not  semi-cyclic,  i.e.,  of 
the  form  wv'',  where  id  is  a  proper  suffix  of  v  and  A;  >  2.  Galil  [Ga79]  showed  how  to  modify 
the  BM  algorithm  so  that  a  linear  bound  applies  in  this  case  too;  in  fact,  using  essentially 
Galil's  modification,  our  bounds  apply  unchanged  to  such  patterns. 

Another  approach  to  improving  the  bound  on  the  number  of  comparisons  for  the  Boyer- 
Moore  algorithm  is  to  modify  the  algorithm  so  that  it  no  longer  necessarily  compares  characters 
of  the  pattern  in  consecutive  right  to  left  order.  Such  a  modification  was  given  in  [AG86];  they 
thereby  obtain  a  bound  of  2n  —  m  +  1  character  comparisons,  at  the  cost  of  a  more  complex 
control  structure  (this  algorithm  remembers  all  characters  of  the  text  that  have  been  matched 
and  thereby  avoids  comparing  text  characters  successfully  more  than  once). 

Many  other  types  of  pattern  matching  algorithms  have  been  studied;  these  include  match- 
ing several  strings  simultaneously  [AC75],  real-time  matching  [Ga81],  matching  in  constant 
space  [GS83,CP89],  randomized  algorithms  [KR87],  parallel  algorithms  [Vi85,Vi90],  approxi- 
mate matching  [LV89,GP89,Uk85]  two-dimensional  matching  [ALV90]. 

The  remainder  of  the  paper  is  organized  as  follows.  In  Section  2,  we  briefly  review  the 
Boyer- Moore  algorithm.  In  Section  3,  we  prove  the  4n  upper  bound  for  non-semi-cyclic  pat- 
terns; in  Section  5,  we  extend  this  upper  bound  to  an  upper  bound  of  roughly  3n  comparisons. 
In  Section  4,  we  show  the  lower  bound  of  roughly  3n  comparisons.  In  Section  6,  we  extend 
the  results  to  semi-cyclic  patterns.  Finally,  for  the  sake  of  completeness,  in  the  Appendix,  we 
describe  how  to  compute  the  shift  functions. 

2      The  Boyer-Moore  Algorithm:  A  Review 

To  find  occurences  of  the  pattern  in  the  text,  the  BM  algorithm  tests  whether  given  substrings, 
t',  of  the  text,  of  length  m,  match  the  pattern;  each  such  test  is  called  an  attempted  match. 
An  attempted  match  is  performed  as  follows:  The  characters  of  the  pattern  are  compared,  one 
by  one,  in  right  to  left  order,  with  the  corresponding  characters  of  substring  t',  until  either  a 
mismatch  is  found  or  the  match  is  complete.  When  an  attempted  match  completes  (either  by 
a  mismatch  or  by  finding  a  match)  the  pattern  is  shifted  to  the  right  by  the  maximum  distance 
consistent  with  not  missing  any  potential  matches.  This  shift  is  determined  by  a  shift  function, 
described  in  the  next  paragraph;  in  fact,  the  actual  shift  may  be  smailler  than  this  maximum. 
Following  the  shift,  another  attempted  match  is  performed.  This  procedure  is  continued  until 
the  pattern  is  shifted  beyond  the  right  end  of  the  text.  The  initial  attempted  match  is  with 
the  leftmost  m  characters  of  the  text. 

The  shift  is  determined  by  two  shift  functions.  If  there  is  a  mismatch,  the  first  shift 
function,  the  occurence  shift,  implicitly  provides  the  location  of  the  rightmost  character,  c,  in 
the  pattern,  if  any,  that  matches  the  mismatched  character  in  the  text.  If  c  is  present,  the  shift 
specified  by  the  occurence  shift  would  cause  c  to  become  aligned  with  the  mismatched  text 
character;  while  if  c  is  not  present,  the  shift  aligns  the  leftmost  character  of  the  pattern  with 
the  text  character  immediately  to  the  right  of  the  mismatched  text  character.  The  second  shift 
function,  the  matching  shift,  is  illustrated  in  Figure  1;  it  specifies  the  smcdlest  shift  such  that 
the  shifted  pattern  matches  the  unshifted  pattern  on  all  the  characters  that  were  successfully 
matched,  and  fails  to  match  the  unshifted  pattern  on  the  mismatched  character;  if  there  is 
no  such  shift,  then  the  smallest  shift  that  causes  a  prefix  of  the  shifted  pattern  to  match  a 


suffix  of  the  matched  characters  is  specified;  if  there  is  no  shift  of  this  type  either,  then  a 
shift  of  length  m  is  specified.  It  is  usual  to  take  the  maximum  of  the  occurence  and  matching 
shifts  as  the  shift  for  the  Boyer-Moore  algorithm.  In  fact,  one  could  use  the  matching  shift 
alone.  Whichever  is  done,  our  upper  and  lower  bounds  apply.  The  presence  of  the  occurence 
shift  is  desirable,  however,  since  it  helps  assure  a  sublinear  behaviour  in  practice.  The  only 
difficulty  is  that  the  proof  of  the  upper  bound  becomes  more  involved,  although  the  essentials 
are  unchanged. 

mismatch 

,1 ,  11 


text 


pattern  1^ 

shifted  pattern  — ^ 


X  ^  yi'  z 


Figure  1:  The  matching  shift 

3      A  Simple  Upper  Bound:  4n  Comparisons 

The  analysis  uses  a  simple  potenticd  function: 

3-  #  positions  not  yet  shifted  over+  #  unread  text  characters 

An  unread  text  character  is  one  that  has  not  been  involved  in  a  comparison  so  far.  Each 
shift  and  the  comparisons  performed  in  the  associated  attempted  match  are  shown  to  have 
an  amortized  cost  of  at  most  zero.  The  initial  potential  is  An  (a  value  of  An  —  3m  might  be 
expected,  but  we  have  to  allow  for  the  final  shift,  which  could  move  the  pattern  up  to  m 
positions  to  the  right  of  the  text). 

Before  analyzing  the  possible  shifts,  we  need  a  definition  and  a  few  results. 
Definition.  A  string  u  is  semi-cyclic  ii  it  can  be  written  in  the  form  wv'',  where  td  is  a  proper 
suffix  of  V  and  k  >  2.    If  u;  =  €,  u  is  said  to  be  cyclic.    Also,  u  is  said  to  be,  respectively, 
semi-cyclic,  cyclic  in  v.  It  is  convenient  to  extend  this  terminology  to  allow  the  wording  u  is 
semi-cyclic  (resp.  cyclic)  in  v  to  include  the  case  fc  =  1;  no  ambiguity  will  result. 

Lemma  1  Let  x  and  y  be  two  non-empty  strings.  If  xy  =  yx  then  there  is  a  string  z  such 
that  both  x  and  y  are  cyclic  in  z. 

Proof.  The  proof  is  by  induction  on  |t|  +  |y|.  If  |i|  =  |j/|,  then  take  z  =  x  {=  y)\  the  result 
follows.  Otherwise,  without  loss  of  generality,  suppose  that  |i|  <  \y\.  Then  i  is  a  prefix  of  y, 
so  we  can  write  y  =  xy\.  Note  that  yi  ^  e.  Substituting  gives  xxy^  =  xj/ix;  i.e.,  xj/i  =  j/ii. 
The  result  now  follows  by  induction.  • 

Corollary  1   Suppose  that  v  is  a  proper  cyclic  shift  ofw.  Ifv  =  w,  v  is  cyclic. 


Proof.  Since  d  is  a  proper  cyclic  shift  of  w,  we  can  write  w  =  xy  and  v  =  yx,  where  x,y  ^  e. 
By  Lemma  1,  there  is  a  string  z  with  x  —  z'  and  y  =  z^ ,  for  some  i,j>  1.  So  u  =  z*,  for  some 
k  >  2;  i.e.,  v  is  cyclic.  • 

A  few  more  definitions  are  helpful,  right(v)  denotes  the  rightmost  character  of  r.  rightmost{i\  u) 
denotes  the  rightmost  substring  of  u  equal  to  v.  left{v)  and  leftmost{v,u)  are  defined  anal- 
ogously. 

To  begin  with,  we  suppose  that  only  the  matching  shift  is  used;  subsequently  we  remove 
this  assumption.  Recall  that  we  are  Jissuming  the  pattern  is  not  semi-cyclic. 

Suppose  that  an  unsuccessful  attempted  match,  called  the  current  attempted  match,  causes 
a  shift  by  distance  s.  Let  u  be  the  suffix  of  the  pattern  of  length  5.  Suppose  that  u  =  r*, 
k  >  I,  where  v  is  not  cyclic.  Let  t  be  the  portion  of  text  matched  in  the  current  attempted 
match.  Clearly,  if  \t\  <  35,  the  current  attempted  match  has  amortized  cost  at  most  0  (since 
the  rightmost  character  of  t  was  unread  prior  to  the  current  attempted  match).  So  suppose 
that  |t|  >  35.  Then  t  must  be  semi-cyclic  in  u  and  hence  in  v.  Our  goal  is  to  show  that  prior 
to  the  current  attempted  match  only  the  following  characters  of  t  can  have  been  read:  The 
leftmost  |v|  —  1  characters,  and  the  rightmost  2\v\  —  1,  excluding  the  rightmost  character  itself 
(.^ee  Figure  2). 


text  h^^r^ 1  ^w'^^^  I 

\v\  -  1 

key:         ^/  only  characters  that  might 

be  read  prior  to  current 
attempted  match 

Figure  2:  Characters  already  read 

An  attempted  match  AM,  which  precedes  the  current  attempted  match,  and  begins  by 
comparing  a  character  of  t,  is  called  an  early  attempted  match  (with  respect  to  the  current 
attempted  match). 

Lemma  2  Let  AM  be  an  early  attempted  match.  Then  right{p)  is  not  aligned  with  right{v) 
for  any  substring  v  in  t. 

Proof.  See  Figure  3.  For  suppose  that  the  pattern  were  so  aligned.  Then  a  mismatch  occurs 
immediately  to  the  left  of  left(t)  and  right{p)  would  be  shifted  to  distance  s  to  the  right  of 
right{t),  contradicting  the  fact  that  AM  is  an  early  attempted  match.  (For  any  shorter  shift 
by  a  multiple  of  |u|  would  place  the  same  character  in  the  mismatch  location;  any  other  shorter 
shift  causes  a  proper  cyclic  shift  of  v  (in  p)  to  be  aligned  with  an  instance  of  v  in  (,  which 
cannot  occur  by  Corollary  1,  since  v  is  not  cyclic.)  • 

Lemma  3  Let  AM  be  an  early  attempted  match.  Then  AM  performs  at  most  \v\  comparisons 
with  characters  of  t.  Further,  if  there  are  \v\  such  comparisons,  the  last  comparison  is  a 
mismatch. 

Proof.  If  |t;|  characters  off  were  matched,  by  Corollary  1,  v  would  be  cyclic  (since,  by  Lemma 
2,  right{p)  is  not  aligned  with  right{v)  for  any  substring  v  in  t).  • 


mismatch 


text 


pattern 


resulting 
shifted  pattern 


Figure  3:  Proof  of  Lemma  2 


Lemma  4  Let  AM  be  an  early  attempted  match,  right(p)  is  either  aligned  with  a  character 
in  rightmost{v,t)  or  with  one  of  the  leftmost  \v\  -  1  characters  in  t. 

Proof.  See  Figure  4.  For  if  right{p)  is  elsewhere,  the  first  \v\  comparisons,  if  there  were  that 
many,  would  be  with  characters  in  t.  Then,  by  Lemma  3,  there  must  be  a  mismatch  on  or 
before  the  |u|th  comparison;  i.e.,  there  are  at  most  \v\  comparisons.  Let  v  denote  the  substring 
V  of  t  containing  the  character  with  which  right{p)  is  aligned.  But  then  the  shift  at  most 
moves  right{p)  to  right{v)  (for  this  shift  would  produce  a  match  with  all  the  text  characters 
compared  in  the  attempted  match  and  so  certainly  suffices).  In  fact,  if  right{p)  is  shifted 
less  far,  eventually,  following  a  sequence  of  such  attempted  matches,  right{p)  is  aligned  with 
right{v).  But  this  contradicts  Lemma  2.  • 


text 


pattern 


mismatch 

i 


■< >• 

<\v\ 

possible  I 

shifted  pattern 

Figure  4:  Proof  of  Lemma  4 
We  now  conclude: 

Lemma  5   Prior  to  the  current  attempted  match,  at  most  2\v\  —  3  characters  oft  have  been 
read. 

Proof.  See  Figure  2.  By  Lemma  4,  in  the  early  attempted  matches,  right{p)  is  either  aligned 
with  one  of  the  rightmost  \v\  characters  of  t  or  with  one  of  the  leftmost  |t;|  —  1  characters  of  t. 
By  Lemma  3,  each  of  the  former  attempted  matches  performs  at  most  |t;|  comparisons.  Since 
right{t)  is  unread  prior  to  the  current  attemped  match,  the  lemma  follows.  • 
We  can  now  bound  the  amortized  cost  of  the  current  attemped  match. 


Letntna  6    The  current  attempted  match  has  amortized  cost  at  most  zero. 

Proof.  Let  s  be  the  distance  shifted.  The  number  of  comparisons  performed  is  |?|  +  1.  If 
35  >  \t\,  then  the  potential  is  reduced  by  at  least  3^  +  1  >  |/|  +  1  (since  the  rightmost  character 
matched  had  been  unread);  the  result  follows  in  this  case. 

So  suppose  that  3s  <  \t\.  Then  by  Lemma  5,  the  number  of  characters  of  t  read  is  at  most 
3|i;|  —  3  <  3s  -  3.  So  the  reduction  in  potential  is  at  least  3s  +  (|(|  -  3s  +  3)  =  \t\  +  3.  The 
result  follows  in  this  case  aJso.  • 

It  remains  to  consider  the  case  that  an  attempted  match  matches  the  complete  pattern. 
By  assumption,  the  pattern  is  not  semi-cyclic;  so  the  shift  must  be  of  length  at  least  ^^^\  it 
follows  that  this  attempted  match  also  has  amortized  cost  at  most  zero. 

We  have  shown: 

Theorem  1  If  the  Boyer-Moore  pattern  matching  algorithm  determine  its  shifts  using  only  the 
matching  shift  rule,  then  it  performs  at  most  4n  comparisons  when  matching  a  non-semi-cyclic 
pattern  of  length  m  against  a  text  of  length  n. 

Now,  we  extend  the  result  to  allow  the  use  of  both  the  occurrence  and  matching  shifts  in 
the  case  of  non-semi-cyclic  patterns.  Consider  the  proof  of  Lemma  6.  We  redefine  s  to  be 
the  length  of  the  shift  determined  by  the  matching  shift  (which  is  either  equal  to  or  smaller 
than  the  length  of  the  actual  shift).  In  order  for  Lemma  6  to  continue  to  hold,  it  suffices  that 
Lemma  5  remain  true  in  the  presence  of  the  occurrence  shift.  But  it  is  a  simple  matter  to 
check  that  Lemmas  2-5  continue  to  hold  in  the  presence  of  the  occurrence  shift.  Thus: 

Theorem  2  The  Boyer-Moore  pattern  matching  algorithm  performs  at  most  4n  comparisons 
when  matching  a  non-semi-cyclic  pattern  of  length  m  against  a  text  of  length  n. 

4     The  Lower  Bound 

We  give  example  patterns,  p,  of  length  m  =  2/:  —  1,  and  texts  <,  which  demonstrate  the  lower 
bound  of  3n(l  —  o(l))  comparisons  (as  ^  — >^  c»  and  m  — ►  oo). 

The  basic  idea  is  to  have  frequent  shifts  by  k  on  attempted  matches  comprising  2k  —  \ 
comparisons.  In  addition,  in  order  to  have  roughly  3  comparisons  per  text  character,  we  also 
seek  to  have  short  shifts  on  attempted  matches  of  roughly  k  comparisons. 

In  particular,  we  choose  p  =  a*~^6a''~^  and  t  =  a''~^{aba''~^Y.  Let  v'  =  aba'^'^ . 

Consider  a  situation  in  which  the  b  in  the  pattern  is  aligned  with  the  left  end  of  a  v'  in  the 
text,  denoted  v' .  (See  Figure  5.)  The  attempted  match  will  be  of  length  k-\  and  it  will  cause 
a  shift  of  length  1.  The  next  attempted  match  will  have  right{p)  aligned  with  right{v').  The 
attempted  match  will  be  of  length  2fc  -  1  and  it  will  result  in  a  shift  of  length  k.  We  return 
to  the  first  situation. 

We  note  that  t  has  been  chosen  so  that  the  initial  situation  is  the  first  situation.  We  have 
shown: 

Theorem  3  There  exist  patterns  of  length  m  =  2k-l  and  texts  of  length  n  =  A(A;-|-l)-|-(^-l), 
for  any  integers  k  >  2  and  A  >  1,  for  which  the  Boyer-Moore  pattern  matching  algorithm 
performs  "^^^(^-^  +  1)  =  (n-  ^'^)(3-  ;;^)  comparisons.  This  is  3n(  1-0(1)1  comparisons 
as  £■—»  oo  and  m  -*  oo. 

For  even  length  patterns,  with  p  =  a*~^ia*~^,  for  k  >  2,  and  t  one  character  shorter  at 
the  left  end,  we  obtain  a  similar  bound  of  ^^f^{n  -  k-\-2)  =  {n  -  ^^'^)(3  -  ;;^)  comparisons. 
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text  •••    abaajabaa|abaa--- 

pattern  a    a    b    a    a  mismatch  at  a 

first  shift  a    a    b    a    a  match 

second  shift  a    a    b    a    a  iterate 

Figure  5:  The  lower  bound  foik  —  i 

5      An  Upper  Bound  of  Roughly  3n  Comparisons 

A  more  careful  analysis,  in  the  style  of  Section  3,  yields  an  upper  bound  of  slightly  fewer  than 
3n  comparisons.  We  use  the  following  potential  function: 

2-  #  positions  not  yet  shifted  over  +  #  unmarked  characters. 

We  show  that  each  attempted  match  heis  an  amortized  cost  of  at  most  -1.  Initially,  all  the 
characters  of  the  text  are  unmarked.  The  unmarked  characters  are  treated  in  the  same  way 
as  unread  characters  in  Section  ?  When  an  unmarked  character  is  read  it  becomes  marked. 
However,  a  marked  character  can  be  unmarked  anew,  as  follows:  Whenever  an  attempted 
match  would  have  an  amortized  cost  less  than  —2,  the  leftmost  characters  compared  in  the 
current  attempted  match  are  unmarked  so  as  to  increase  the  amortized  cost  to  —2  (or  until  all 
the  characters  read  in  the  current  attempted  match  are  unmarked,  whichever  occurs  sooner); 
actually,  as  we  see  below,  there  is  a  special  ca^e  for  shifts  of  length  1. 

As  in  Section  3,  we  begin  by  considering  the  variant  of  the  Boyer-Moore  algorithm  in 
which  only  the  matching  shift  rule  is  used.  Also,  we  are  still  cissuming  that  the  pattern  is  not 
semi-cyclic. 

As  in  Section  3,  we  consider  a  current  attempted  match  which  matches  a  substring  t  of  the 
text,  and  mismatches  immediately  to  the  left  of  left{t).  We  also  consider  earlier  attempted 
matches  in  which  right{p)  is  aligned  with  a  character  of  t.  Let  5  be  the  distance  shifted  in  the 
current  attempted  match.  There  are  a  number  of  cases  to  consider  (in  all  the  cases  apart  from 
Case  1  and  Case  3.3.1  we  show  an  amortized  cost  of  at  most  —2  for  the  current  attempted 
match,  and  for  Ccise  3.3.1  this  bound  can  be  achieved  by  a  more  careful  and  longer  argument). 
Case  1.  3=1.  Let  t'  be  the  text  read  by  the  current  attempted  match.  Then  t'  =  ba^  for 
some  j  >  0,  where  a  ^  b.  We  claim  that  the  characters  off'  had  not  been  read  previously.  This 
would  give  an  overall  amortized  cost  of  —2  for  the  attempted  match,  except  that  we  unmark 
the  rightmost  character  of  <',  giving  an  amortized  cost  of  —1.  It  remains  to  demonstrate  the 
claim.  If  there  had  been  an  earlier  attempted  match  with  right{p)  aligned  with  some  character 
of  t',  then  the  mismatch  would  have  been  at  character  6  of  <',  and  the  resulting  shift  would 
have  moved  right(p)  beyond  right{t)\  so  there  was  no  such  earlier  attempted  match;  i.e.,  all 
of  t'  was  unmarked  immediately  prior  to  the  current  attempted  match. 

Remark  1.  Henceforth,  we  can  assume  that  at  each  attempted  match  the  rightmost  two 
characters  compared  (if  at  least  two  are  compared)  are  unmarked. 

Case  2.  \t\  <  2s.  The  decrease  in  potentied  is  at  least  2^+2.  The  cost  of  the  current  attempted 
match  is  |t|  +  1.  So  the  amortized  cost  of  the  current  attempted  match  is  at  most  —2. 
Case  3.  \t\  >  2s  and  s  >  I.  Then  t  =  wv'',  for  some  k  >2,  where  t/;  is  a  proper  suffix  of  v,  v 
is  non-cyclic,  and  s  is  an  integer  multiple  of  |t;|.  Lemmas  2-4  continue  to  hold.  A  little  more 


notation  is  helpful.    Let  xvl-i''-'l-,vri-<^R2  denote,  respectively,  leftTnost{w,t),  leftmost{i\t), 
rightTnost{v,t),  and  the  second  rightmost  i'  in  t.  See  Figure  6. 


text 


Figure  6:  Further  Notation 

Case  3.0.  There  is  no  early  attempted  match.  Then  the  current  attempted  match  has 
amortized  cost  at  most  —2s  +  1  <  — 3  <  —2. 

Case  3.1.  AAI ,  the  first  early  attempted  match,  has  right(p)  aligned  with  a  character  of  Vfn 
other  than  rightivm).  Then  at  most  2\v\  —  3  characters  of  t  will  have  been  read  and  remain 
marked  prior  to  the  current  attempted  match  (for  among  the  characters  of  Vfn  the  rightmost 
two  are  unmarked,  and  by  Lemma  3,  besides  characters  of  vri,  only  the  \v\  —  1  rightmost 
characters  of  u/i2  can  have  been  read).  So  the  amortized  cost  of  the  current  match  is  at  most 
-2. 

Before  considering  the  next  case,  we  prove  a  lemma. 
Lemma  7   Consider  an  early  attempted  match,  AMi,  which  results  in  a  shift  of  length  S\. 

(i)   If  AM  I  matches  at  least  \v\  characters,  then  Si  is  an  integer  multiple  of  \v\.    Also,  the 
number  of  characters  matched  by  AM\  is  exactly  \t\  +  s  —  si. 

(ii)    While  if  AM\  matches  fewer  than  \v\  characters,  then  si  <  \v\. 

Proof.  First  note  that  the  shift  does  not  move  right{p)  to  the  right  of  right{t)  (since  this  is 
an  early  attempted  match).  Thus  5i  <  |trr*^|. 

Now,  we  prove  Part  (i).  Let  v  be  the  string  in  the  text  which  matches  rightmost{v,p) 
during  attempted  match  AMi.  Following  the  shift,  the  substring  5  of  p  aligned  with  v  must 
be  the  pattern  v;  further  v  is  part  of  the  suffix  of  p  of  length  Iwu*"*"^!  (as  s\  <  \wv''\),  which 
suffix  is  semi-cyclic  in  v;  thus  v  is  a  cyclic  shift  of  r;  by  Corollary  1,  this  cyclic  shift  must  be 
the  trivial  shift;  i.e.,  Si  is  an  integer  miiltiple  of  |r|. 

Suppose  that  the  number  of  characters  matched  is  greater  than  |t|  +  5  -  sj.  Let  c  be  the 
(|<|  +  s  — 5i  +  l)th  character  of  the  text  matched  by  AM\.  Then  c  will  not  match  the  character 
of  p  with  which  it  is  aligned  following  the  shift  by  distance  si  (for  this  is  the  (|<|  +  5  +  l)th 
rightmost  character  of  p,  which  differs  from  ch2iracters  /|t;|  to  its  right  in  p,  for  any  /  >  1).  But 
this  contradicts  the  definition  of  the  matching  shift.  So  suppose  that  the  number  of  characters 
matched  is  fewer  than  |<|  +  5  —  Si.  Then  consider  the  character,  c,  of  the  text  with  which 
AMi  mismatches;  the  character  of  p  aligned  with  c  following  the  shift  by  si  is  the  same  as 
the  character  of  p  aligned  with  c  before  the  shift.  Again,  this  contradicts  the  definition  of  the 
matching  shift.  Thus  exactly  |<|  -f  5  —  s\  characters  are  matched  by  AM^. 

We  turn  to  Part  (ii).  Suppose  that  si  >  \v\.  Suppose  c  characters  are  compared  by  AMi. 
See  Figure  7.  Then  a  shift  by  Si  —  \v\  would  also  be  a  legal  shift,  in  that  the  characters  of 
the  pattern  aligned  with  the  c  text  characters  compared  by  AMi  are  identical  in  a  shift  by  Si 
and  a  shift  by  sj  -  |r|  (this  follows  from  the  fact  that  the  portion  of  p  semi-cyclic  in  v  is  of 
length  at  least  si  +  \v\).  So  S\  <  \v\.  But  a  shift  by  |i'|  is  not  possible  for  this  would  replace 
the  mismatched  character  of  the  pattern  by  the  same  character,  which  is  not  a  legal  shift.  So 
5i  <  |r|.  • 
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Figure  7:  Proof  of  Lemma  7 

Case  3.2.  AMi  is  an  early  attempted  match  in  which  right{p)  is  aligned  with  a  character  in 
wl,  which  causes  right{p)  to  be  shifted  distance  5i  so  that  it  is  aligned  with  a  character  in  vm. 
Note  that  3i  >  |i;|  (for  right{p)  shifts  over  the  whole  of  vl).  By  Lemma  7,  the  shift  has  length 
an  integer  multiple  of  v.  Suppose  that  at  the  start  of  AMi  right{p)  is  aligned  with  the  rth 
rightmost  character  of  wl.  Then,  following  AMi,  right{p)  is  aligned  with  the  rth  rightmost 
character  of  Vfi\  ■  By  Lemma  3,  at  most  l^l  —  1  characters  of  the  text  immediately  to  the  left  of 
this  location  can  be  read  prior  to  the  current  attempted  match.  So  the  number  of  characters 
oft  that  can  be  marked  prior  to  the  current  attempted  match  is  bounded  as  follows:  At  most 
|ry|  —  r  +  1  characters  of  wl,  plus  at  most  max{(r  —  2)  +  (|i;|  —  1),0}  =  (r  -  2)  +  {\v\  -  1) 
characters  at  the  right  end  of  t.  This  is  a  total  of  |u;|  +  |t;|  —  2  <  2\v\  —  3  characters.  Hence 
the  current  attempted  match  has  amortized  cost  at  most  —2. 

Case  3.3.  AM\  is  an  early  attempted  match  in  which  right{p)  is  cdigned  with  a  character  in 
vl,  which  causes  a  shift,  by  distance  3\,  that  aligns  right{p)  with  a  character  in  Vf^.  By  the 
argument  of  the  proof  of  Lemma  4,  the  characters  matched  by  AMi  include  at  least  all  the 
characters  up  to  the  left{t)  (for  if  not  the  resulting  shift  would  leave  right(p)  aligned  with  a 
character  of  vi).  Let  i  be  the  suffix  of  p  that  overlaps  v^  prior  to  the  attempted  match  AMi, 
and  let  y  be  the  suffix  that  overlaps  Vfn  after  this  attempted  match.  We  consider  two  subcases 
depending  on  the  number  of  characters  matched  by  AMi. 

Case  3.3.1.  At  least  \v\  characters  are  matched.  See  Figure  8.  By  Lemma  7,  5i  is  an  integer 
multiple  of  |t;|.  So  x  =  y.  Consider  the  next  attempted  match,  AM2,  also  an  early  attempted 
match.  Its  mismatch  occurs  at  the  (|ii>i|  +  l)th  character  compared  (for  the  {\wx\  +  l)th 
characters  of  the  text  compared  by  AMi  and  AM2  are  not  equal,  and  as  \wx\  <  |t;|  by  Lemma 
3  applied  to  AMi,  this  is  a  character  matched  by  AMi).  Suppose  AM2  results  in  a  shift  by 
32-  Let  the  suffix  of  p  of  length  S2  be  z. 

Suppose  that  \z\  <  \w\.  Then  wx  =  z'z',  where  z'  is  a  proper  suffix  of  z  and  /  >  1.  Let  z 
be  the  shortest  non-cyclic  string  such  that  z  is  cyclic  in  z  (possibly  z  =  z).  x  is  cyclic  in  2  (for 
consider  z  =  rightmost{z)  in  uh2i  and  consider  the  substring,  I  of  p  aligned  with  z  following 
the  shift  of  AMi;  if  i  =  y  is  not  cyclic  in  2,  then  I  is  a  proper  cyclic  shift  of  2,  which,  by 
Corollary  1,  contradicts  the  minimality  of  2).  But  the  fact  that  the  shift  due  to  AM2  causes 
a  match  on  the  characters  matched  by  AM2,  meauis  that  p  has  a  suffix,  semi-cyclic  in  2,  of 
length  |u;i2|.  Since  v  is  also  a  suffix  of  p  and  |u>|-M  <  |t;|,  the  rightmost  \'w\  +  1  characters  of  v 
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Figure  8:  Ca^e  3.3.1 

are  semi-cyclic  in  i;  so  there  is  no  mismatch  by  AM2  at  the  location  assumed  (the  {\wx\  +  l)th 
character  compared  by  /I A/2).  This  is  a  contradiction. 

So  \z\  >  \w\.  Then,  prior  to  unmarking  read  nodes,  the  amortized  cost  of  AM2  would 
be  — 2|2|  since  all  the  characters  compared  were  previously  unread  (as  \w\  +  \x\  <  \v\).  So 
inin{2|z|  —  2,|u;i|  +  1}  >  \w\  characters  are  unmarked.  Hence,  following  attempted  match 
AM2,  the  \w\  characters  immediately  to  the  right  of  leftmost(x,  Vfi2)  are  unmarked;  by  Lemma 
3,  these  characters  are  not  read  anew  prior  to  the  current  attempted  match.  So  the  number 
of  marked  characters  in  t  at  the  time  of  the  current  attempted  match  is  bounded  by:  {\v\  —  2) 
characters  in  Vfn,  {\v\  —  \w\)  characters  between  r/?2  and  vl  (note  that  vi  is  not  necessarily 
equal  to  UR2)  a^nd  \w\  characters  in  wl;  this  is  a  total  of  at  most  2\v\  —  2  marked  characters. 
Hence  the  amortized  cost  of  the  current  attempted  match  is  at  most  —1. 
Case  3.3.2.  Fewer  than  |t;|  characters  are  matched.  Let  r  be  the  number  of  characters 
matched.  Note  that  r  >  \w\. 

First,  suppose  that  s-i  >  r.  Following  the  shift  due  to  AMi,  at  least  sj  unmarked  characters 
can  be  created;  so,  in  fact,  all  the  characters  read  by  AMi  are  unmarked.  These  unmarked 
characters  include  wl  plus  the  character  immediately  to  the  right  of  w^.  By  Lemma  3,  these 
|u;|  +  1  characters  are  not  reread  until  the  current  attempted  match.  So  there  are  at  most 
2\v\  —  3  marked  characters  in  t  at  the  time  of  the  current  attempted  match,  and  hence  the 
amortized  cost  of  the  current  attempted  match  is  at  most  —2. 

Second,  suppose  that  Si  <  r.  We  show  this  case  cannot  arise.  So  suppose  this  case  occurs. 
As  Si  <  |u|,  \y\  <  \x\.  See  Figure  9.  Since  a  suffix  of  p,  of  length  \vt\,  is  serai-cyclic  in  v, 
and  the  shifted  pattern  matches  the  text  characters  matched  by  AMi,  y  must  be  a  prefix  of 
I.  Next,  we  show  that  i  is  semi-cyclic  in  y.  The  portion,  x,  of  the  shifted  pattern,  aligned 
with  the  suffix  z  of  p  prior  to  the  shift,  must  match  the  suffix  x.  Note  that  x  =  yv'  where 
v'  is  a  prefix  of  v.  Prior  to  the  shift,  left{x)  is  aligned  with  left{vi),  so  v'  is  a  prefix  of  i 
also.  It  follows  that  x  =  y'y',  for  some  i  >  1,  where  y'  is  a  proper  prefix  of  y.  As  y  is  also  a 
suffix  of  I  (since  they  are  both  suffixes  of  p),  using  Corollary  1,  we  conclude  that  there  is  a 
non-cyclic  string  z  such  that  both  x  and  y  are  cyclic  in  z.  We  also  conclude  that  the  suffix  of  p 
of  length  r  is  of  the  form  z'z*  for  some  i  >  1,  where  z'  is  a  proper  suffix  of  z.  (For  consider  the 
portion  of  the  pattern,  following  AMi,  which  is  aligned  with  the  rightmost  r  characters  of  the 
pattern  prior  to  AMi;  it  comprises  the  prefix  of  v  of  length  |i|  -  |y|,  preceded  on  the  left  by 
the  suffix  of  V  of  length  r  —  (|i|  —  \y\).  Also,  it  matches  the  suffix  of  v  of  length  r.  As  the  prefix 
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of  V  of  length  |i|  -  \y\  is  cyclic  in  z  the  assertion  follows.)  Also,  note  that  i  is  a  prefix  of  v 
(consider  the  attempted  match  AA/j  and  the  portion  of  t  with  which  x  is  matched).  To  avoid 
V  being  cyclic  in  z,  we  need  that  in  i;  =  rightmost(v,p),  the  suffix  of  length  r  overlap  with 
leftTnost{x,v)  by  fewer  than  |2|  characters;  i.e.,  r  +  |x|  <  |i'|  +  l^l,  so  r  <  |i'|  +  \y\  -  \x\  =  si; 
this  is  a  contradiction.  So  this  case  does  not  arise. 
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Figure  9:  Case  3.3.2 

It  remains  to  consider  the  case  that  an  attempted  match  matches  the  complete  pattern. 
As  in  Section  3,  since  the  pattern  is  not  semi-cyclic,  the  shift  must  be  of  length  at  least  ^^^, 
and  so  such  an  attempted  match  has  amortized  cost  at  most  -3  <  -2. 

We  can  conclude: 

Theorem  4  The  Boyer-Moort  pattern  matching  algorithm,  restricted  to  using  only  the  match- 
ing shift,  performs  at  most  3n  —  ^  comparisons  when  matching  a  non-semi-cyclic  pattern  of 
length  m  againt  a  text  of  length  n. 

Proof.  Let  r  (<  m)  be  the  distance  finally  shifted  beyond  the  right  end  of  the  text.  As 
each  shift  has  amortized  cost  at  most  —1,  and  as  each  shift  is  by  distance  at  most  m,  the 
amortized  cost  of  the  shifts  in  total  is  at  most  —""''""'"' .  Since  the  potential  drops  by  at  most 
3n  —  2(Tn  —  r),  the  result  follows.  • 

Improving  the  bound  of  Case  3.3.1  to  aji  amortized  cost  of  at  most  -2,  improves  the  bound 
of  the  theorem  to  3n  —  ^  comparisons.  Actually,  this  is  still  too  large  a  bound  for  it  assumes 
each  shift  is  by  distance  m  (which  would  result  in  fewer  comparisons  being  made).  Proving 
a  tight  bound  appears  to  involve  a  quite  elaborate  (and  tedious)  case  analysis,  which  (in  my 
opinion)  is  not  of  sufficient  interest  to  merit  being  detailed. 

It  remains  to  extend  the  result  to  incorporate  the  occurence  shift.  This  is  not  quite  as 
simple  as  in  Section  3.  We  follow  the  analysis  used  earlier  in  this  section  and  as  in  Section  3, 
s  will  denote  the  length  of  the  matching  shift.  Cases  1,  2,  3.0  and  3.1  apply  unchanged.  Cases 
3.2  and  3.3  require  a  new  analysis.  We  introduce  a  new  rule  for  unmarking  nodes:  When 
the  occurence  shift  is  used,  suppose  that  the  length  of  the  occurence  shift  exceeds  that  of  the 
matching  shift  by  h;  then  the  rightmost  min{2/i,  \t\  +  1}  characters  compared  are  unmarked. 
Ceise  3.2.  See  Figure  10.  A  new  case  arises  when  AMi  uses  the  occurence  shift.  Let  x  be  the 
suffix  of  Wl  to  the  right  of  right{p)  prior  to  attempted  match  AM\ .  Consider  the  first  early 
attempted  match,  AM2,  subsequent  to  AMi,  if  any,  which  compares  characters  of  Vfi2  beyond 
the  rightmost  |i|  characters.  If  AM2  does  not  exist,  then  at  most  (|v|  -2)  +  |i|  +  max{0,  \w\  - 
|i|  -  2}  <  2\v\  -  4  characters  of  t  wiU  be  marked  prior  to  the  current  match  (recall  the  at  least 
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two  characters  unmarked  due  to  the  occurence  shift):  while  if  AM2  has  right{p)  to  the  right 
of  the  leftmost  \w\  -  \x\  characters  in  r/ji,  then  at  most  [|i'|  -  (|u'|  -  |x|)  -  2]  +  (|i'|  -  1)  + 
max{0,  \w\  —  |i|  —  2}  <  2|v|  —  4  characters  of  t  will  be  marked  prior  to  the  current  match.  In 
these  cases  the  current  match  has  amortized  cost  at  mos;  —3  <  —2. 
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Figure  10:  Case  3.2  -  the  Heuristic  Shift 

So  suppose  that  AM2  has  right{p)  aligned  with  one  of  the  leftmost  |u'|  —  |i|  characters  in 
Vfti.  We  first  consider  the  case  that  i  ^  e.  Let  x  be  the  shortest  non-cyclic  string  such  that 
I  is  cyclic  in  x  (possibly  x  =  x).  Since  AMi  matches  every  character  of  w^  to  the  left  of  i, 
w  is  semi-cyclic  in  x  and  thus  in  i.  Let  v^  be  the  longest  suffix  of  v  such  that  v^  =  x'x-'  for 
some  i  >  1,  where  x'  is  a  proper  suffix  of  i.  As  AM2  matches  the  rightmost  x  characters  of 
Vfi2,  and  the  corresponding  matched  characters  of  p  are  among  p's  righmost  \w\  characters,  it 
must  be  that  the  substrings  i  in  u/j2  and  p  are  aligned  (or  else  Corollary  1  is  contradicted);  so 
j  >  2  (for  Tightmost{x,  u/jj)  is  matched,  as  is  at  least  one  string  i  to  its  right).  It  follows  that 
exactly  the  suffix  v,;  of  p  is  matched.  So  the  matching  shift  for  AM2  has  length  greater  than 
|x'x-'~^  |.  The  reduction  in  potential  is  at  least  the  length  of  the  suffix  of  Vfi2  compared  by  AAI2 
minus  |xl  (previously  unread  text),  plus  twice  the  shift  (>  2|x'x-'~^|  +  2);  this  suffices  to  pay 
for  all  the  characters  read  plus  the  unmarking,  in  vji2,  of  all  but  the  rightmost  |x|  characters. 
Also,  following  the  shift  by  AM2,  right{p)  is  to  the  right  of  the  leftmost  |ti;|  -  |x|  characters 
in  VRi.  Now,  by  the  argument  of  the  last  two  sentences  of  the  previous  paragraph,  it  follows 
that  the  current  attempted  match  has  amortized  cost  at  most  —2. 

Now  suppose  that  x  =  e.  We  show  that  this  case  cannot  occur.  For  if  x  =  e,  the  mismatch 
in  AMi  would  occur  immediately  to  the  left  of  le}t(t).  As  the  heuristic  shift  moves  the 
rightmost  occurence  of  a  character  to  the  mismatch  location,  since  the  character  placed  in  the 
mismatch  location  is  from  rightTnost(t,p)  it  must  be  from  rightmost(v,p);  but  this  contradicts 
the  fact  that  the  actual  shift  was  of  length  greater  than  |v|. 

Case  3.3.1.  If  AMi  uses  the  matching  shift  we  argue  as  in  the  previous  Case  3.3.1  (defining 
52  to  be  the  shift  given  by  the  matching  shift).  So  suppose  that  AMi  uses  the  occurence  shift. 
Let  X  denote  the  overlap  of  p  and  vl  prior  to  attempted  match  AAfj.  Let  x  be  the  shortest 
non-cyclic  string  in  which  x  is  cyclic.  Clearly  tt;  =  i'x',  for  some  i  >  0,  where  x'  is  a  proper 
suffix  of  i.  Let  Vx  denote  the  longest  suffix  of  v  semi-cyclic  in  i;  \vj:\  >  |u7x|.  There  are  a 
number  of  cases  to  consider. 

First,  suppose  that  the  matching  shift  of  AMi  would  move  right{p)  to  be  aligned  with 
a  character  of  vri.  Since  the  matching  shift,  in  this  case,  has  length  a  multiple  of  |t;|  (by 
Lemma  7),  following  the  matching  shift,  p  would  overlap  Vfn  by  |i|  characters.  The  occurence 
shift  must  result  in  a  greater  overlap  (as  it  is  the  shift  used).  Consider  the  first  subsequent 
early  attempted  match,  AM2,  that  compares  a  character  of  rR2,  if  any.  If  AM2  does  not 
exist,  or  if,  in  AM2,  right{p)  is  to  the  right  of  the  leftmost  |u;x|  characters  of  Vfn,  then  the 
number  of  marked  characters  in  t  at  the  time  of  the  current  attempted  match  is  at  most 
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majc{|t;|  -  2,(|z;|  -  2  -  \wx\)  +  (\v\  -  1)}  +  |ii'|  +  |i|  <  2|i'|  -  3,  and  the  amortized  cost  of  the 
current  attempted  match  is  at  most  -2.  So  suppose  that  right(p)  is  aligned  with  one  of  the 
leftmost  \wx\  characters  in  Vfn.  There  are  two  subcases  to  consider,  depending  on  the  length 
of  I. 


text 


WL  VL  Vfi\ 

pattern  1 \ 

X 

pattern  prior  to  AM2  1 1 — — | 


pattern  after  AAI2 1 j 

applies  the  matching  shift 

Figure  11:  Case  3.3.1  -  the  Heuristic  Shift 

Suppose  that  |i|  <  |r|/2.  See  Figure  11.  Let  aj  be  the  length  of  the  matching  shift  for  AM2 
and  let  z  be  the  suffix  of  p  of  length  s^.  Let  x'  be  the  overlap  of  p  beyond  leftTnost(x,VFii) 
at  the  start  of  attempted  match  AM^;  by  Jissumption,  |i'|  <  \w\.  Let  x'  be  the  shortest 
non-cyclic  string  with  x'  cyclic  in  x'.  x  is  cyclic  in  x'  (this  can  be  seen  by  an  argument  similar 
to  that  used  in  the  previous  Case  3.3.1  to  show  x  is  cyclic  in  i).  Thus  w  is  semi-cyclic  in  x'; 
we  conclude  that  AM2  mismatches  at  the  [\wx\  +  l)th  character  compared  (for  the  longest 
suffix  of  V  semi-cyclic  in  x'  is  of  length  \wx\,  as  AM\  does  not  mismatch  at  the  (\wx\  +  l)th 
character  compared).  By  another  argument  similar  to  that  of  the  previous  Case  3.3.1,  we 
conclude  that  \w\  <  \z\.  We  want  at  least  \w\  +  1  characters  of  Vfi2  to  remain  unmarked  aside 
from  leftmost(x,  VR2),  for  then  at  most  2|t;|  -  3  characters  of  t  will  be  marked  at  the  time  of 
the  current  attempted  match.  Then,  following  the  shift  due  to  AM2,  the  potential  at  hand  for 
unmarking  characters  anew  is  at  least  2\z\  -  \x\  -  \x'\;  to  obtain  \w\  +  1  unmarked  characters 
in  V/J2  aside  from  leftTnost{x,Vfi2),  we  need 

(|t,|  -  |x|)  -  (|u;|  -  |x'|  +  1)  +  (2|z|  -  |xl  -  |x'|)  >\w\  +  l 

i.e.,  \v\  -  2\x\  +  2\z\  >  2\w\  +  2,  which  is  true  when  |i|  <  \v\/2. 

So  suppose  that  x  >  \v\/2.  Then  both  v  and  x  are  semi-cyclic  in  some  non-cyclic  y, 
where  \xy\  <  \v\  (since  x  matches  a  prefix  of  v).  Any  early  attempted  match,  subsequent 
to  AM-[,  which  matches  a  character  of  vr2  will  have  matched  a  suffix  of  p  of  length  greater 
than  |x|(>  \y\)  and  at  least  |x|  characters  of  Vfn.  Also,  such  an  early  attempted  match  can 
compare  at  most  \y\  characters  of  Vfi2  (for  v  is  not  cyclic  in  y,  by  assumption,  and  as  at  least 
one  character  of  Vfn  is  compared,  the  "y"  substrings  in  p  and  Vfi2  are  not  aligned).  The 
resulting  shift  will  be  of  length  a  multiple  of  |y|,  as  y  is  non-cyclic  and  v  is  semi-cyclic  in  y. 
Furthermore,  the  resulting  shift  must  ailign  right{p)  and  right{t)  (as  this  is  the  least  shift  by 
a  multiple  of  \y\  that  places  a  different  character  in  the  mismatch  location).  So  the  number  of 
marked  characters  for  the  current  attempted  match  is  at  most  |t;|  +  {\w\  +  |a:|  -  2)  <  2\v\  —  3 
(the  |i;|  term  is  for  the  characters  compared  in  the  attempted  match  immediately  prior  to  the 
current  attempted  match,  and  the  other  term  is  for  the  characters  compared  by  AM\.,  the  —2 
term  arises  because  at  least  two  characters  are  unmarked  due  to  the  occurence  shift). 
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Second,  suppose  that  the  matching  shift  of  AMi  does  not  align  right(p)  with  a  character 
of  Vfii.  Then,  if  i  <  \v\/2,  the  length  of  the  occurence  shift  exceeds  that  of  the  matching  shift 
by  at  least  \v\/2;  the  extra  (at  least)  |i>|  decrease  in  potential  is  used  to  mark  leftmost(wx,t), 
which  then  results  in  at  most  2\v\  -  3  characters  being  marked  for  the  current  attempted 
match.  So  suppose  that  |i|  >  \v\/2.  Then  x  and  v  are  semi-cyclic  in  some  non-cyclic  j/,  where 
\xy\  <  \v\.  Hence  any  subsequent  early  attempted  match  which  compares  characters  of  Vfi2 
can  compare  at  most  y  such  characters  (for  v  is  not  cyclic  in  y,  by  assumption,  and  as  at  least 
one  character  of  vri  is  compared,  the  "y"  substrings  in  p  and  t;/j2  are  not  aligned).  So  none  of 
the  characters  oi  leftmost{x,Vfi2)  are  read  anew  prior  to  the  current  attempted  match.  The 
occurence  shift  exceeded  the  length  of  the  matching  shift  by  at  least  |y|  +  1,  hence  at  least 
the  \y\  +  1  rightmost  characters  matched  by  AMi  are  unmarked.  Further,  as  just  noted,  these 
characters  are  not  read  anew  prior  to  the  current  attempted  match.  So  the  number  of  marked 
characters  at  the  time  of  the  current  attempted  match  is  at  most 

{\v\  -  2)  +  \y\  +  i\w\  +  |x|)  -  i\y\  +  1)  <  2|t;|  -  3 

We  conclude  that  the  current  attempted  match  has  amortized  cost  at  most  -2. 
Case  3.3.2.  We  use  the  notation  from  the  earlier  Case  3.3.2.  Again,  we  are  only  concerned 
with  the  case  in  which  AMi  uses  the  occurence  shift.  The  ca.se  Sj  >  r  is  handled  as  before. 
So  suppose  that  si  <  r.  Consider  the  character,  c,  of  t  which  was  mismatched  by  attempted 
match  AMi.  The  character,  c',  of  p  that  matches  c  following  the  shift  of  AMi  must  be  a 
character  of  the  second  rightmost  d  of  p  (for  the  shift  was  by  distance  <  \v\  and  fewer  than  \v\ 
characters  were  matched,  so  c'  must  be  among  the  rightmost  2\v\  characters  of  p;  yet  as  c  is 
to  the  left  of  t,  c'  cannot  be  among  the  rightmost  |i;|  characters  of  p,  for  p  overlaps  t  by  more 
than  \v\  characters  following  the  shift  by  AM\).  But  then  this  is  not  the  rightmost  occurence 
of  this  character  in  p,  and  so  this  shift  is  not  the  shift  given  by  the  occurence  shift.  This  is  a 
contradiction.  Hence  the  case  Si  <  r  does  not  occur. 

We  have  shown: 

Theorem  5  The  Boyer-Moore  pattern  matching  algorithm  performs  3n  — ^  comparisons  when 
matching  a  non-semi-cyclic  pattern  of  length  m  againt  a  text  of  length  n. 

6      Semi  Cyclic  Patterns 

Now  we  handle  the  case  of  semi-cyclic  patterns.  So  suppose  that  the  pattern,  p,  is  of  the  form 
p  =  wv\  where  v  is  not  cyclic,  tr  is  a  proper  suffix  of  v,  possibly  empty,  i  >  2,  and  v  is  the 
shortest  such  string. 

We  use  the  pattern  p'  =  uv,  where  u  is  the  suffix  of  v  of  length  |r|  —  1.  Note  that  p'  is  not 
semi-cyclic.  In  addition,  we  record  the  following  index  j,0  <  j  <  i.  Let  q^  =  wv^ .  See  Figure 
12.  We  record  the  largest  j  such  that  the  text  contains  an  instance  qj  of  the  pattern  qj,  where 
right{qj)  is  aligned  with  right{u). 

Following  an  attempted  match,  j  is  modified  as  follows.  If  p'  is  matched,  then  the  shift 
is  by  distance  |u|  (by  Corollary  1,  a  shorter  shift  contradicts  the  fact  that  v  is  not  cyclic).  If 
j  =  i  —  l,  then  a  match  of  p  has  been  found;  j  is  unchanged,  li  j  <  :  -  1,  then  j  is  incremented. 
If  p'  is  not  matched,  but  at  least  wv  is  matched,  then  j  is  set  to  1  (note  that  the  shift  is  by 
distance  s  =  \v\  in  this  case  too);  while  if  not  even  wv  is  matched,  then  j  is  set  to  0. 
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z  -^  V 
text  1 


pattern  p' 


If  z  has  suffix  w,  j  =  2 
otherwise,  j  =  1 

Figure  12:  Semi-cyclic  patterns 

The  comparison  count  for  the  non-semi-cyclic  case  applies  here  too,  since  p'  is  not  semi- 
cyclic. 

We  conclude: 

Theorem  6  The  Boyer-Moore  pattern  matching  algorithm  performs  at  most  ^n  —  —  compar- 
isons in  matching  a  pattern  oj  length  m  against  a  text  of  length  n. 
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A      Computing  the  shift  functions 

We  describe  how  to  compute  the  shift  functions.  Descriptions  of  these  procedures  have  been 
given  previously,  for  instance  in  [KMP77,Ry80].  The  following  notation  will  be  helpful:  p[i,j] 
will  denote  the  substring  of  the  pattern  extending  from  the  ith  to  j'th  characters  inclusive  and 
p{i)  will  denote  the  ith  character  of  the  pattern. 

The  occurence  shift  is  computed  by  a  left  to  right  sweep  accross  the  pattern.  Suppose  the 
pattern  is  indexed  from  1  to  m,  from  left  to  right.  Then,  for  each  i,  1  <  i  <  m,  we  perform  the 
a.ssignment  o-shift{p{i))  :=  z.  This  stores,  in  oshift{c),  the  rightmost  occurence  of  character 
c  in  the  pattern.  If  there  is  a  mismatch  at  p{j),  and  the  mismatched  text  character  is  c,  the 
occurence  shift  is  given  by  j  —  o^hift{c).  To  handle  the  possibility  that  c  does  not  occur  in 
the  pattern,  o-shift{c)  should  be  initialized  to  0,  prior  to  the  sweep. 

The  computation  of  the  matching  shift  is  less  simple.  A  program  is  given  in  Figure  13. 
We  introduce  an  auxiliary  shift  function,  called  the  kmpshift.  It  is  the  shift  function  used  in 
some  versions  of  the  Knuth-Morris-Pratt  algorithm  (except  that  here,  we  are  matching  from 
the  right  end  of  the  pattern).  Its  definition  follows.  Define  l{j)  to  be  the  smallest  index  with 
KJ)  >  h  such  that  p[/(j),m]  is  a  prefix  of  p\j,m]\  if  l{j)  is  not  thereby  defined,  it  is  set  to 
m  +  1.  Then  kmp-shift{j)  =  l{j)—j.  It  is  straightforward  to  check  that  kmpshift  is  correctly 
computed  in  Stage  1. 

Next,  we  explain  why  the  matching  shift,  mshift{j),  for  1  <  j  <  m,  is  correctly  computed. 
There  are  two  possibilities. 

Either  we  need  to  determine  a  largest  index  k  <  j  such  that  p{j)  ^^  p{k)  and  p[j  +  1,  m]  is 
a  prefix  of  p[k  +  1,  m]  (which  results  in  a  matching  shift  of  j  -  k)  or  we  need  to  find  a  smallest 
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Initialize 

I  <  j  <  m:  m^hift{j)  :=   m; 
kmp^hift{m)  :=    1; 
Stage  1 

j  :=    m; 

do  A:  =  m  —  1  to  1  by  —1 
while  p{j)  ^  p{k)  do 

mshiftij)  :=  mm{Tn-shift(j),j  -  k); 
if  j  <  m  then  j  :=   j  +  kTnpshift{i  +  1) 

else  exit  the  while  loop 
end; 
ifp(j)  =  p{k) 

then  do  kmp^hift{k)  :=   j  —  k;  j  :=   j  —  I  end 
else  (the  loop  was  exited  and  so  j  =  m) 
kmpshift{k)  :=   j  -  k  ■{■  I 
end; 
Stage  2 

j  :=   j  +  1;  j'-o/d  :=    1; 
while  j  <  m  do 

for  i  =  j_o/<f  to  J  —  1  do  Tnshift{i)  :=   min{m-s/ii/<(i),  j  —  1}  end; 

j-old  :=   j;  j  :=   j  +  kmp^hift{j) 

end 

The  program  can  be  optimized,  but  at  some  loss  in  clarity. 


Figure  13:  Program  to  Compute  the  Matching  Shift 

j  >  k  such  that  p[j,m]  is  a  prefix  of  p  (this  results  in  a  matching  shift  of  j  —  1).   These  are 
computed  in  Stages  1  and  2,  respectively. 

Let  us  consider  the  computation  of  Stage  1.  Basically,  it  seeks  to  find  longest  suffixes, 
p[j,rn],  of  the  pattern  that  are  proper  prefixes  oi  p[k,m]  (so  j  >  k),  for  each  k,  m  >  k  >  1. 
The  algorithm  proceeds  by  traversing  two  instances  of  the  pattern,  one  indexed  by  j  and  one 
by  k.  The  invariant  is  that  p\j  +  l,fn]  is  a  prefix  of  p[A:  + 1 ,  m] .  If  a  match  is  found  (p(j)  =  p{k)) 
we  move  one  position  to  the  left  in  both  patterns;  if  no  match  is  found,  we  slide  the  pattern 
indexed  by  j  to  the  left  by  the  minimum  possible  distance  (i.e.,  so  that  the  longest  proper  suffix 
of  p[j  +  1,  m]  which  matches  a  prefix  of  p\j  +  1,  m]  is  now  aligned  with  a  prefix  of  p[k  +  1,  m]; 
this  amounts  to  the  assignment  j  :=  j  +  kmp-shift{j  +1),  unless  j  —  m,  in  which  case  we 
slide  the  pattern  one  cell  to  the  left,  which  leaves  j  unchanged  and  decrements  A:).  How  could 
we  fail  to  compute  mshift^j')  =  j'  -  k'  for  some  j'l  There  are  two  possibilities.  Either  k  is 
decremented  to  a  value  smaller  than  k'  while  j  <  j',  or  j  is  incremented  to  a  value  that  ensures 
the  cell  eventually  compared  to  p{k')  is  to  the  right  of  p{j').  The  first  possibility  requires  a 
match  of  p(/i)  and  p{k')  to  be  found  for  some  k'  <  h  <  j'.  But  then  mshift{j')  <  j'  -  h, 
contrary  to  assumption.  The  second  possibility  requires  there  to  be  h  and  I,  k'  <  h  <  j'  and 
/  >  0,  with  p{k'  +  /)  and  p{h  +  /)  being  matched,  and  kTnpshift{h  +  /  +  1)  >  j'  -  /i;  but  this 
also  contradicts  the  fact  that  mshift{j')  =  j'  -  k'.  So,  in  fact.  Stage  1  correctly  computes  all 
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the  matching  shifts  of  the  first  type. 

Let  us  turn  to  Stage  2.  In  order  to  find  Tn^hift{i),  we  need  to  know  the  least  j  >  i  such 
that  p[j,  m]  is  a  prefix  of  p.  But  this  is  exactly  what  is  computed  by  Stage  2. 

It  is  straightforward  to  observe  that  the  shift  functions  can  be  computed  in  time  0(Tn). 
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